sigmoid function
- North America > Canada > Quebec > Montreal (0.14)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- Europe (0.04)
- Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)
- North America > Canada > Quebec > Montreal (0.14)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- Europe (0.04)
- Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)
- Asia > South Korea > Seoul > Seoul (0.05)
- North America > United States > Wisconsin (0.04)
- North America > United States > Massachusetts (0.04)
- Asia > Middle East > Israel (0.04)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
the liberty to group and reword some of the reviewers comment (in blue italic) to save space. 3 General answer on the usefulness of gradient descent, its theoretical guarantees, and its scalability
We thank the reviewers for the time they spent evaluating our manuscript and for their valuable comments. We agree that having theoretical guarantees would be a big plus. As for scalability, the bottleneck of our method is the single-linkage algorithm. Similarly to Monath et al. (NeurIPS 2017), our idea consists Given the significant body of additional material, we feel that this topic is best left to a future publication. Line 8,56,70,93: I would suggest a more cautious usage of the word "equivalent".
LaMPE: Length-aware Multi-grained Positional Encoding for Adaptive Long-context Scaling Without Training
Zhang, Sikui, Gao, Guangze, Gan, Ziyun, Yuan, Chunfeng, Lin, Zefeng, Peng, Houwen, Li, Bing, Hu, Weiming
Large language models (LLMs) experience significant performance degradation when the input exceeds the pretrain-ing context window, primarily due to the out-of-distribution (OOD) behavior of Rotary Position Embedding (RoPE). Recent studies mitigate this problem by remapping OOD positions into the in-distribution range with fixed mapping strategies, ignoring the dynamic relationship between input length and the model's effective context window. To this end, we propose Length-aware M ulti-grained P ositional Encoding (LaMPE), a training-free method that fully utilizes the model's effective context window for adaptive long-context scaling in LLMs. Motivated by the left-skewed frequency distribution of relative positions, LaMPE establishes a dynamic relationship between mapping length and input length through a parametric scaled sigmoid function to adaptively allocate positional capacity across varying input lengths. Meanwhile, LaMPE devises a novel multi-grained attention mechanism that strategically allocates positional resolution across different sequence regions to capture both fine-grained locality and long-range dependencies. Our method can be seamlessly applied to a wide range of RoPE-based LLMs without training. Extensive experiments on three representative LLMs across five mainstream long-context benchmarks demonstrate that LaMPE achieves significant performance improvements compared to existing length extrapolation methods. The code will be released at https://github.com/scar-on/LaMPE.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- Asia > China > Beijing > Beijing (0.04)
Unraveling the Black-box Magic: An Analysis of Neural Networks' Dynamic Local Extrema
We point out that neural networks are not black boxes, and their generalization stems from the ability to dynamically map a dataset to the local extrema of the model function. We further prove that the number of local extrema in a neural network is positively correlated with the number of its parameters, and on this basis, we give a new algorithm that is different from the back-propagation algorithm, which we call the extremum-increment algorithm. Some difficult situations, such as gradient vanishing and overfitting, can be reasonably explained and dealt with in this framework.